Introduction

In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.

This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle.

This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.

Our analysis is driven by three core objectives:

Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.

By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.

The dataset contains the following attributes:

Attribute Name Description
1 name The make and model of the vehicle (e.g., Hyundai i10, Honda City).
2 year The year the vehicle was manufactured.
3 selling_price The selling price of the vehicle in Indian Rupees (INR).
4 km_driven The total distance the vehicle has been driven (in km).
5 fuel The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”).
6 seller_type The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”).
7 transmission The type of transmission system (“Manual” or “Automatic”).
8 owner The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”).
9 mileage The fuel efficiency of the vehicle (in either km/l or km/kg).
10 engine The engine displacement capacity (in CC).
11 max_power The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp).
12 torque A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges.
13 seats The seating capacity of the vehicle as an integer.

The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.

Attribute Type Attribute Name
Categorical Variables name, fuel, seller_type, transmission, owner
Discrete Numeric Variables year, km_driven, seats
Continuous Numeric Variables selling_price, mileage, engine, max_power, torque

Methods

# Set document-wide options 
options(tibble.width = Inf)  # Make tibbles display all columns

Load dataset

car_details = read_csv("Car details v3.csv")

Total number of observations

nrow(car_details)
## [1] 8128

Sample of data from Vehicle dataset

head(car_details)
## # A tibble: 6 × 13
##   name                           year selling_price km_driven fuel   seller_type
##   <chr>                         <dbl>         <dbl>     <dbl> <chr>  <chr>      
## 1 Maruti Swift Dzire VDI         2014        450000    145500 Diesel Individual 
## 2 Skoda Rapid 1.5 TDI Ambition   2014        370000    120000 Diesel Individual 
## 3 Honda City 2017-2020 EXi       2006        158000    140000 Petrol Individual 
## 4 Hyundai i20 Sportz Diesel      2010        225000    127000 Diesel Individual 
## 5 Maruti Swift VXI BSIII         2007        130000    120000 Petrol Individual 
## 6 Hyundai Xcent 1.2 VTVT E Plus  2017        440000     45000 Petrol Individual 
##   transmission owner        mileage    engine  max_power 
##   <chr>        <chr>        <chr>      <chr>   <chr>     
## 1 Manual       First Owner  23.4 kmpl  1248 CC 74 bhp    
## 2 Manual       Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual       Third Owner  17.7 kmpl  1497 CC 78 bhp    
## 4 Manual       First Owner  23.0 kmpl  1396 CC 90 bhp    
## 5 Manual       First Owner  16.1 kmpl  1298 CC 88.2 bhp  
## 6 Manual       First Owner  20.14 kmpl 1197 CC 81.86 bhp 
##   torque                   seats
##   <chr>                    <dbl>
## 1 190Nm@ 2000rpm               5
## 2 250Nm@ 1500-2500rpm          5
## 3 12.7@ 2,700(kgm@ rpm)        5
## 4 22.4 kgm at 1750-2750rpm     5
## 5 11.5@ 4,500(kgm@ rpm)        5
## 6 113.75nm@ 4000rpm            5

Data cleaning

Excluding missing values from dataset

car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906

Excluding duplicate rows from dataset

sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717

Transforming and renaming variables

Transforming name variable
# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)
Transforming mileage variable
# View raw data for "mileage" variable
head(car_details$mileage, n=20)
##  [1] "23.4 kmpl"  "21.14 kmpl" "17.7 kmpl"  "23.0 kmpl"  "16.1 kmpl" 
##  [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl"  "23.59 kmpl" "20.0 kmpl" 
## [11] "19.01 kmpl" "17.3 kmpl"  "19.3 kmpl"  "18.9 kmpl"  "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl"  "19.7 kmpl"  "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage" 
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)

# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"

# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
##  [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50

Above, the mileage variable is renamed to fuel_efficiency to prevent confusion with the km_driven variable, which represents the quantity normally referred to as vehicle “mileage”.

Transforming engine variable
# View raw data for "engine" variable
head(car_details$engine, n=20)
##  [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC" 
##  [8] "1364 CC" "1399 CC" "1461 CC" "993 CC"  "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC"  "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
##  [1] 1248 1498 1497 1396 1298 1197  796 1364 1399 1461  993 1248 1061 1198 1248
## [16] 1396  796 1396 1461 1498
Transforming max_power variable
# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
##  [1] "74 bhp"     "103.52 bhp" "78 bhp"     "90 bhp"     "88.2 bhp"  
##  [6] "81.86 bhp"  "37 bhp"     "67.1 bhp"   "68.1 bhp"   "108.45 bhp"
## [11] "60 bhp"     "73.9 bhp"   "67 bhp"     "82 bhp"     "88.5 bhp"  
## [16] "90 bhp"     "46.3 bhp"   "88.73 bhp"  "64.1 bhp"   "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
##  [1]  74.00 103.52  78.00  90.00  88.20  81.86  37.00  67.10  68.10 108.45
## [11]  60.00  73.90  67.00  82.00  88.50  90.00  46.30  88.73  64.10  98.60

Type conversion of variables

Converting character-typed columns to factor type
# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor 
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
Converting km_driven to km_driven_in_10k
# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k" 
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
##   km_driven_in_10k
##              <dbl>
## 1             14.6
## 2             12  
## 3             14  
## 4             12.7
## 5             12  
## 6              4.5
Converting selling_price to selling_price_in_10k
# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
##   selling_price_in_10k
##                  <dbl>
## 1                 45  
## 2                 37  
## 3                 15.8
## 4                 22.5
## 5                 13  
## 6                 44
Dropping torque column
# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]

Creating make_category variable from make variable

unique(car_details$make)
##  [1] Maruti        Skoda         Honda         Hyundai       Toyota       
##  [6] Ford          Renault       Mahindra      Tata          Chevrolet    
## [11] Datsun        Jeep          Mercedes-Benz Mitsubishi    Audi         
## [16] Volkswagen    BMW           Nissan        Lexus         Jaguar       
## [21] Land          MG            Volvo         Daewoo        Kia          
## [26] Fiat          Force         Ambassador    Ashok         Isuzu        
## [31] Opel         
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
  • From the above result we can see that there are 31 unique values for “make”.
  • When we model the data using the independent variable “make”, there will be at least 30 dummy variables as predictors.
  • To reduce the model complexity and to increase interpretability, the car make can be grouped into broader categories as “Budget”, “Mid-Range” or “Luxury” depending on general market perception.
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in% 
                                     c("Maruti", "Tata", "Mahindra", "Datsun", 
                                        "Renault", "Chevrolet", "Fiat", 
                                         "Daewoo", "Ambassador", "Ashok"), 
                                    "Budget",
                             ifelse(car_details$make %in% 
                                   c("Honda", "Ford", "Hyundai","Toyota",    "Volkswagen","Nissan", 
                                        "Skoda", "Mitsubishi","Force", "Kia","MG"), 
                                    "Midrange", 
                                    "Luxury"))

# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)

# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget"   "Luxury"   "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)  
##           make make_category
##         Maruti        Budget
##          Skoda      Midrange
##          Honda      Midrange
##        Hyundai      Midrange
##         Toyota      Midrange
##           Ford      Midrange
##        Renault        Budget
##       Mahindra        Budget
##           Tata        Budget
##      Chevrolet        Budget
##         Datsun        Budget
##           Jeep        Luxury
##  Mercedes-Benz        Luxury
##     Mitsubishi      Midrange
##           Audi        Luxury
##     Volkswagen      Midrange
##            BMW        Luxury
##         Nissan      Midrange
##          Lexus        Luxury
##         Jaguar        Luxury
##           Land        Luxury
##             MG      Midrange
##          Volvo        Luxury
##         Daewoo        Budget
##            Kia      Midrange
##           Fiat        Budget
##          Force      Midrange
##     Ambassador        Budget
##          Ashok        Budget
##          Isuzu        Luxury
##           Opel        Luxury

Final structure of the car_details dataset

Attribute Name Description Variable Type
make The manufacturer of the vehicle (e.g., Maruti, Hyundai, Honda, etc.). Factor
year The year the vehicle was manufactured. Numeric
selling_price_in_10k The selling price of the vehicle in ten-thousands of Indian Rupees (INR). Numeric
km_driven_in_10k The total distance the vehicle has been driven, measured in ten-thousands of kilometers. Numeric
fuel The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). Factor
seller_type The type of seller (“Individual”, “Dealer”, or “Trustmark Dealer”). Factor
transmission The type of transmission system (“Manual” or “Automatic”). Factor
owner The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). Factor
fuel_efficiency The fuel efficiency of the vehicle in km/l. Numeric
engine The engine displacement capacity of the vehicle, measured in cubic centimeters (CC). Numeric
max_power The maximum power output of the vehicle’s engine, measured in brake horsepower (BHP). Numeric
seats The seating capacity of the vehicle. Numeric
make_category Categorized vehicle make (e.g., “Luxury”, “Midrange”, or “Budget”). Factor

Sample of final dataset

head(car_details, n=8)
## # A tibble: 8 × 13
##   make     year selling_price_in_10k km_driven_in_10k fuel   seller_type
##   <fct>   <dbl>                <dbl>            <dbl> <fct>  <fct>      
## 1 Maruti   2014                 45               14.6 Diesel Individual 
## 2 Skoda    2014                 37               12   Diesel Individual 
## 3 Honda    2006                 15.8             14   Petrol Individual 
## 4 Hyundai  2010                 22.5             12.7 Diesel Individual 
## 5 Maruti   2007                 13               12   Petrol Individual 
## 6 Hyundai  2017                 44                4.5 Petrol Individual 
## 7 Maruti   2001                  4.5              0.5 Petrol Individual 
## 8 Toyota   2011                 35                9   Diesel Individual 
##   transmission owner        fuel_efficiency engine max_power seats make_category
##   <fct>        <fct>                  <dbl>  <dbl>     <dbl> <dbl> <fct>        
## 1 Manual       First Owner             23.4   1248      74       5 Budget       
## 2 Manual       Second Owner            21.1   1498     104.      5 Midrange     
## 3 Manual       Third Owner             17.7   1497      78       5 Midrange     
## 4 Manual       First Owner             23     1396      90       5 Midrange     
## 5 Manual       First Owner             16.1   1298      88.2     5 Budget       
## 6 Manual       First Owner             20.1   1197      81.9     5 Midrange     
## 7 Manual       Second Owner            16.1    796      37       4 Budget       
## 8 Manual       First Owner             23.6   1364      67.1     5 Midrange

Data Analysis

Variable Distribution Analysis

car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
  • There is one observation, which is quite different from the general pattern of selling price. This can impact the model that we build, hence excluded one observation with selling_price = 1000.
Selling Price distribution
# Histogram of selling prices 
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
     main="Selling Price distribution",
     breaks = 30, 
     col = "lightblue")

  • The above plot is positively skewed, meaning the selling prices for most of the observations are less than or equal to 2 million INR, and there are much fewer observations with selling prices above this amount.
Frequency Distribution of Categorical Variables
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)

par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
        xlab = "Number of Cars",
        ylab = "Car Make",
        col = "lightblue",
        cex.names = 0.5,
        main = "Num. of Cars in Each Make")

# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")

# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")

# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")

# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")

Boxplots of Selling Price vs Categorical Variables

Selling Price by Car Make
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
        las = 2,cex.axis = 0.7,               
        main = "Boxplot of Selling Price by Car Make",
        xlab = "Car Make",
        ylab = "Selling Price (10,000 INR)")

  • The above boxplots indicates a significant variation in selling prices across different car makes. Some brands have much wider price ranges than others.
  • Brands like Mercedes-Benz, BMW, Jaguar, Land Rover and Volvo have the highest median selling prices, while cars like Tata, Maruti, and Daewoo have the lowest median selling prices.
Selling Price by Fuel Type
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
        col = rainbow(length(unique(car_details$fuel))),
        main = "Box Plot of Selling Price by Fuel Type",
        xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")

  • From the above boxplot we can see that the median selling price of Diesel cars are slightly higher than the median selling price of Petrol cars.
Selling Price by Seller Type
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
        col = rainbow(length(unique(car_details$seller_type))),
        main = "Box Plot of Selling Price by Seller Type",
        xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")

  • The cars sold by Individuals have the lowest median cost when compared to cars sold by dealers.
Selling Price by Transmission Type
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
     col = rainbow(length(unique(car_details$transmission))),
     main = "Boxplot of Selling Price by Transmission Type",
     xlab = "Transmission Type", 
     ylab = "Selling Price (10,000 INR)")

  • We can see that there is a difference in the selling price of the Automatic and Manual transmission cars. The cost range of Automatic transmission cars is higher than that of Manual transmission cars.
Selling Price by Owner Type
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
     col = rainbow(length(unique(car_details$owner))),
     main = "Boxplot of Selling Price vs Owner Type",
     xlab = "Owner Type", 
     ylab = "Selling Price (10,000 INR)")

  • The selling prices of cars across different owner types are significantly different.
  • The median price of test drive cars are very high and the rest of the owner types have low median cost
  • The median selling price is in the decreasing order of First Owner, Second Owner, Third Owner, Fourth & above Owner.

Scatter Plots of Selling Price vs Numerical Variables

Selling Price vs Year and Transmission Type
# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
     main = "Selling Price vs Year and Transmission Type",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
       col = c("blue", "red"), pch = 19)

  • We can see a positive correlation between year and selling price. As the year increases, the selling price is increasing.
  • This suggests that newer cars are priced higher than older ones.
  • We can also see that automatic transmission cars have a higher selling price across all years.
Selling Price vs Year and Make Category
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
     col = colours[car_details$make_category],pch=19,
     main = "Selling Price vs Year and Make Category",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
       col = c("blue", "green","orange"), pch = 19)

  • Similar to previous plot, there is a positive correlation between year and selling price.
  • We see that cost of Budget cars, mid range cars and luxury car increases with Year.
Selling Price vs Km Driven
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven")

# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)

# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven (extreme values removed)")

  • The relationship between selling price and kilometers driven doesn’t seem to be strongly linear.
  • But we can see that as km driven increases, the selling price remains in low range.
  • There are 2 observations which are different from the general pattern with values of km driven (150.0000 236.0457). This can been seen in the above plot.
  • These observations can impact the model. Hence those two data points are excluded from the second plot.
Selling Price vs Fuel Efficiency
# Subset out 15 observations with fuel_efficiency=0 
car_details = subset(car_details, car_details$fuel_efficiency!=0,)

# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
     xlab="Fuel Efficiency (km/l)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Fuel Efficiency")

  • Most data points are clustered at fuel efficiency values of 10 to 30 km/l, and there doesn’t seem to be a linear relationship.
  • That is higher fuel efficiency doesn’t indicate higher selling price.
Selling Price vs Engine Displacement

  • From the above plot, we can see that the selling price is high for higher values of engine power
Selling Price vs Max Power
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
     col=colours[car_details$fuel],
     main="Selling Price vs Max Power (by Fuel Type)",
     xlab="Max Power (bhp)",
     ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
       col = c("blue", "green"), pch = 19)

  • From the above plot, we can see that the there is a linear relationship between max power and selling price.
  • We can conclude that when the maximum power increases, the selling price of the car increases
  • In general, we see that diesel-powered cars have higher selling prices.
    • But in the above plot, we see that petrol-powered cars with higher horsepower have higher selling prices.
Conclusions of Variable Distribution Analysis
  • The selling price distribution is positively skewed. Positively skewed data has extreme values which makes it hard to fit models.
    • Hence, logarithmic transformation can make the selling price distribution to be normally distributed.
    • Logarithmic transformations can also help stabilize the variance, making the data more homoscedastic and suitable for analysis.
  • There is a positive correlation between year and selling price.
  • The selling price also tends to increase with engine displacement and max power.
  • Prices of budget cars, midrange cars and luxury car increase with Year.
  • As the km driven increases, the selling price tends to decrease.
  • There is no impact of fuel efficiency on selling price.
  • The median price of automatic transmission cars is higher than that of manual transmission cars.
  • The median price of test drive cars is very high when compared to other owner types.
  • Prices of diesel cars are generally high. But prices of petrol cars with high horsepower are also high.

Analysis of Correlation Between Numeric Variables

# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year + km_driven_in_10k + fuel_efficiency + engine + max_power + seats, data = car_details)

# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[, sapply(car_details, is.numeric)])
cor_mat
##                             year selling_price_in_10k km_driven_in_10k
## year                  1.00000000            0.4395767      -0.45496390
## selling_price_in_10k  0.43957666            1.0000000      -0.19812586
## km_driven_in_10k     -0.45496390           -0.1981259       1.00000000
## fuel_efficiency       0.37352047           -0.1258892      -0.24186898
## engine               -0.02333154            0.4480476       0.30860425
## max_power             0.15993052            0.6857687       0.04886655
## seats                 0.02630866            0.1648269       0.24955889
##                      fuel_efficiency      engine   max_power       seats
## year                       0.3735205 -0.02333154  0.15993052  0.02630866
## selling_price_in_10k      -0.1258892  0.44804759  0.68576871  0.16482693
## km_driven_in_10k          -0.2418690  0.30860425  0.04886655  0.24955889
## fuel_efficiency            1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine                    -0.5990946  1.00000000  0.68399138  0.66312580
## max_power                 -0.4007856  0.68399138  1.00000000  0.25911446
## seats                     -0.4855295  0.66312580  0.25911446  1.00000000
# Extract high correlation values from the upper triangle of the correlation 
# matrix (to remove redundant correlations caused by symmetry)
high_cor_indices = which(upper.tri(cor_mat) & cor_mat > 0.5, arr.ind = TRUE)

# Create table for high-correlation variables
high_cor_df = data.frame(
  row = rownames(cor_mat)[high_cor_indices[, 1]],
  column = colnames(cor_mat)[high_cor_indices[, 2]],
  correlation = cor_mat[high_cor_indices]
)

# Display the first few entries in the table
head(high_cor_df)
##                    row    column correlation
## 1 selling_price_in_10k max_power   0.6857687
## 2               engine max_power   0.6839914
## 3               engine     seats   0.6631258
  • From the above table of high-correlation variable pairs, we can see that the following variable pairs have correlations above 0.5:
    • max_power and selling_price have a high correlation of 0.6857687. Hence, selling price can increases with max power.
    • max_power and engine have a high correlation of 0.6839914. This indicates that engine displacement and max_power are collinear.
    • seats and engine have high a correlation of 0.6631258. This indicates that engine displacement and seats have high collinearity.

Model Development and Validation

In this section, a model for predicting values of selling_price based on the values of the other variables is developed and the performance of the model is analyzed.

Splitting data into train and test sets

set.seed(125) 
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]

# Remove `make` from train_data and test_data
#There are 31 unique values for the field "make" and hence there will be at least 30 dummy variables if "make" is used in model building. This increases the complexity of the model and hence a broader category "make_category" is used.
train_data <- subset(train_data, select = -make)
test_data <- subset(test_data, select = -make)

nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323

Full additive linear model

# Definition of "full" linear model
full_model = lm(selling_price_in_10k ~ ., data = train_data)

# Model summary
summary(full_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212.47  -11.13   -1.44    8.25  321.72 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -7.830e+03  2.819e+02 -27.777  < 2e-16 ***
## year                         3.902e+00  1.414e-01  27.608  < 2e-16 ***
## km_driven_in_10k            -8.666e-01  9.579e-02  -9.047  < 2e-16 ***
## fuelPetrol                  -9.928e+00  1.202e+00  -8.261  < 2e-16 ***
## seller_typeIndividual       -8.391e+00  1.298e+00  -6.462 1.12e-10 ***
## seller_typeTrustmark Dealer -5.155e+00  5.535e+00  -0.931  0.35168    
## transmissionManual          -1.522e+01  1.542e+00  -9.869  < 2e-16 ***
## ownerFourth & Above Owner   -2.120e+00  2.663e+00  -0.796  0.42610    
## ownerSecond Owner           -4.924e+00  9.240e-01  -5.328 1.03e-07 ***
## ownerTest Drive Car          2.483e+02  1.217e+01  20.405  < 2e-16 ***
## ownerThird Owner            -3.582e+00  1.539e+00  -2.328  0.01995 *  
## fuel_efficiency             -3.249e-01  1.809e-01  -1.796  0.07249 .  
## engine                       9.718e-03  1.811e-03   5.366 8.41e-08 ***
## max_power                    4.635e-01  2.086e-02  22.224  < 2e-16 ***
## seats                        1.160e+00  6.149e-01   1.886  0.05935 .  
## make_categoryLuxury          1.034e+02  3.115e+00  33.195  < 2e-16 ***
## make_categoryMidrange        2.517e+00  8.482e-01   2.968  0.00301 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.9 on 5272 degrees of freedom
## Multiple R-squared:  0.7122, Adjusted R-squared:  0.7113 
## F-statistic: 815.4 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(full_model))
## [1] 17
#Leave one out cross validation RMSE 
sqrt(mean((resid(full_model) / (1 - hatvalues(full_model))) ^ 2))
## [1] 27.30411
# Diagnostic plots
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))

# Diagnostic tests
bp_test <- bptest(full_model)
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  full_model
## BP = 1310, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(full_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.71385, p-value < 2.2e-16
# Multicollinearity test
vif(full_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.184945  1        1.478156
## km_driven_in_10k 1.643485  1        1.281985
## fuel             2.603160  1        1.613431
## seller_type      1.117854  2        1.028244
## transmission     1.404885  1        1.185278
## owner            1.407956  4        1.043695
## fuel_efficiency  3.679447  1        1.918188
## engine           5.853557  1        2.419412
## max_power        3.175790  1        1.782075
## seats            2.684514  1        1.638449
## make_category    1.929111  2        1.178526
Results of full linear additive
  • There are 17 parameters used in the full additive linear model, which is relatively high.

  • The R^2 value of the model is 0.7122. That is 71.22% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 1310 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.71385 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 27.30411, which is low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_type”(p_value=0.372486) and “seats” (p_value=0.888617)

Small linear additive model

  • In the above full model, the variables “seller_type”(p_value=0.372486) and “seats” (p_value=0.888617) have high p value.
  • Following small additive model is build with the all the predictors, excluding the non significant predictors seller_type and seats from the full model.
small_add_model = lm(selling_price_in_10k ~ . -seller_type -seats , data = train_data)

# Model summary
summary(small_add_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ . - seller_type - seats, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.71  -11.03   -1.54    8.09  326.48 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -7.953e+03  2.722e+02 -29.214  < 2e-16 ***
## year                       3.963e+00  1.361e-01  29.129  < 2e-16 ***
## km_driven_in_10k          -9.068e-01  9.577e-02  -9.468  < 2e-16 ***
## fuelPetrol                -1.032e+01  1.194e+00  -8.641  < 2e-16 ***
## transmissionManual        -1.583e+01  1.536e+00 -10.307  < 2e-16 ***
## ownerFourth & Above Owner -2.772e+00  2.670e+00  -1.038  0.29924    
## ownerSecond Owner         -5.537e+00  9.231e-01  -5.998 2.13e-09 ***
## ownerTest Drive Car        2.538e+02  1.219e+01  20.822  < 2e-16 ***
## ownerThird Owner          -4.388e+00  1.540e+00  -2.850  0.00439 ** 
## fuel_efficiency           -3.962e-01  1.725e-01  -2.297  0.02163 *  
## engine                     1.111e-02  1.656e-03   6.708 2.17e-11 ***
## max_power                  4.640e-01  2.035e-02  22.800  < 2e-16 ***
## make_categoryLuxury        1.031e+02  3.063e+00  33.666  < 2e-16 ***
## make_categoryMidrange      2.429e+00  8.085e-01   3.004  0.00268 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.01 on 5275 degrees of freedom
## Multiple R-squared:  0.7097, Adjusted R-squared:  0.709 
## F-statistic: 992.1 on 13 and 5275 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(small_add_model))
## [1] 14
#Leave one out cross validation RMSE 
sqrt(mean((resid(small_add_model) / (1 - hatvalues(small_add_model))) ^ 2))
## [1] 27.39708
# Diagnostic plots
par(mfrow = c(1, 2))
plot(small_add_model,which=c(1,2))

# Diagnostic tests
bptest(small_add_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  small_add_model
## BP = 1308.8, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(small_add_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.70295, p-value < 2.2e-16
# Multicollinearity test
vif(small_add_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.008053  1        1.417058
## km_driven_in_10k 1.629810  1        1.276640
## fuel             2.550651  1        1.597076
## transmission     1.381643  1        1.175433
## owner            1.381414  4        1.041215
## fuel_efficiency  3.319208  1        1.821869
## engine           4.857050  1        2.203872
## max_power        2.999960  1        1.732039
## make_category    1.712363  2        1.143929
Results of small linear additive
  • There are 14 parameters used in the small additive linear model.

  • The R^2 value of the model is 0.7097,which is less than the R^2 value of full model. That is 70.97% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 1308.8 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.70295 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 27.39708, which is low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “ownerFourth & Above Owner”(p_value=0.29924) and fuel_efficiency(p_value = 0.02163)

ANOVA of Small and Full additive models

anova(small_add_model, full_model)
## Analysis of Variance Table
## 
## Model 1: selling_price_in_10k ~ (year + km_driven_in_10k + fuel + seller_type + 
##     transmission + owner + fuel_efficiency + engine + max_power + 
##     seats + make_category) - seller_type - seats
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type + 
##     transmission + owner + fuel_efficiency + engine + max_power + 
##     seats + make_category
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   5275 3848848                                  
## 2   5272 3816027  3     32820 15.114 8.584e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From the above Analysis of Variance(ANOVA) result we can say that Full additive model is statistically better than small additive model, based on the high F-test low p_value and lower RSS.

  • Full additive model has lower RSS (3816027) than small additive model (3848848), indicating that full additive model fits the data better.

  • Low p_value indicates that the small additive model is NOT sufficient in explaining the variance of the response variable.

  • Hence we reject the null hypothesis that H0: small additive model is sufficient in explaining the variance of the response variable.

  • From the ANOVA result we can see the Full additive linear model is better than small additive linear model.

  • But in both the above models we can see that the constant variance assumption of the residuals and normality assumptions of the residuals are violated.

Interaction model | with BIC

  • From the above results we can see that the constant variance assumption is violated for all the above models.
  • Hence there could be some more possibilities to improve the model either through interaction terms or through predictor transformations.
  • In the following model there are 2 interactions terms used.
    1. max_power*fuel
    2. transmission*seller_type
  1. max_power*fuel: This interaction term is between a numeric predictor “max_power” and a categorical predictor “fuel type”. This is used in the interaction model based on the results of scatter plot “Selling Price vs Max Power (by Fuel Type)” in the data analysis session.

  2. transmission*seller_type: This interaction is between two categorical variables transmission type and seller type. This is used in the interaction model based on the results of interaction plot, as indicated below.

  • An interaction plot can be used to visualize the relationship between two categorical variables
  • An interaction occurs when variables combine to affect the response variable.
  • If the lines on the plot intersect, then there is likely an interaction between the variables. If the lines are parallel, then there is no interaction.
par(mfrow = c(2, 2))
interaction.plot(train_data$fuel, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$fuel, train_data$transmission, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$fuel, train_data$make_category, train_data$selling_price_in_10k)

par(mfrow = c(2, 2))
interaction.plot(train_data$make_category, train_data$owner, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$make_category, train_data$selling_price_in_10k)
interaction.plot(train_data$make_category, train_data$seller_type, train_data$selling_price_in_10k)
interaction.plot(train_data$seller_type, train_data$owner, train_data$selling_price_in_10k)

par(mfrow = c(1, 2))
interaction.plot(train_data$seller_type, train_data$fuel, train_data$selling_price_in_10k)
interaction.plot(train_data$transmission, train_data$seller_type, train_data$selling_price_in_10k)

  • From the above interaction plots we can see that, only the last interaction plot has intersecting lines, the rest have parallel lines.
  • In the last plot we can see that the relationship between seller_type and selling price changes depending on the transmission type.
  • Hence we have used the second interaction term in the following model (transmission*seller_type)
n=nrow(train_data)
bic_int_model = step(lm(selling_price_in_10k ~ year + make_category + max_power*fuel + transmission*seller_type + owner + engine, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_int_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power * 
##     fuel + transmission * seller_type + owner + engine, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -217.563  -10.459   -1.017    8.923  307.953 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                    -8.850e+03  2.256e+02 -39.233
## year                                            4.407e+00  1.118e-01  39.413
## make_categoryLuxury                             9.106e+01  3.044e+00  29.915
## make_categoryMidrange                           3.360e+00  7.940e-01   4.231
## max_power                                       5.972e-01  2.140e-02  27.899
## fuelPetrol                                      2.924e+01  2.463e+00  11.873
## transmissionManual                             -4.403e+01  2.884e+00 -15.270
## seller_typeIndividual                          -3.773e+01  2.798e+00 -13.485
## seller_typeTrustmark Dealer                    -3.012e+01  9.124e+00  -3.302
## ownerFourth & Above Owner                      -3.651e+00  2.597e+00  -1.406
## ownerSecond Owner                              -5.409e+00  8.981e-01  -6.023
## ownerTest Drive Car                             2.596e+02  1.204e+01  21.568
## ownerThird Owner                               -4.007e+00  1.498e+00  -2.676
## engine                                          1.379e-02  1.163e-03  11.855
## max_power:fuelPetrol                           -4.182e-01  2.718e-02 -15.390
## transmissionManual:seller_typeIndividual        3.490e+01  3.108e+00  11.230
## transmissionManual:seller_typeTrustmark Dealer  3.783e+01  1.131e+01   3.345
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## year                                            < 2e-16 ***
## make_categoryLuxury                             < 2e-16 ***
## make_categoryMidrange                          2.36e-05 ***
## max_power                                       < 2e-16 ***
## fuelPetrol                                      < 2e-16 ***
## transmissionManual                              < 2e-16 ***
## seller_typeIndividual                           < 2e-16 ***
## seller_typeTrustmark Dealer                    0.000967 ***
## ownerFourth & Above Owner                      0.159804    
## ownerSecond Owner                              1.83e-09 ***
## ownerTest Drive Car                             < 2e-16 ***
## ownerThird Owner                               0.007477 ** 
## engine                                          < 2e-16 ***
## max_power:fuelPetrol                            < 2e-16 ***
## transmissionManual:seller_typeIndividual        < 2e-16 ***
## transmissionManual:seller_typeTrustmark Dealer 0.000827 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.25 on 5272 degrees of freedom
## Multiple R-squared:  0.726,  Adjusted R-squared:  0.7252 
## F-statistic: 873.1 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_int_model))
## [1] 17
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_int_model) / (1 - hatvalues(bic_int_model))) ^ 2))
## [1] 26.73532
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model,which=c(1,2))

# Diagnostic tests
bptest(bic_int_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_int_model
## BP = 1522.5, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.73061, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                               GVIF Df GVIF^(1/(2*Df))
## year                      1.436331  1        1.198470
## make_category             1.857596  2        1.167449
## max_power                 3.513535  1        1.874443
## fuel                     11.482762  1        3.388623
## transmission              5.157209  1        2.270949
## seller_type              15.303631  2        1.977874
## owner                     1.432764  4        1.045976
## engine                    2.536586  1        1.592666
## max_power:fuel            9.919899  1        3.149587
## transmission:seller_type 28.618175  2        2.312919
Results of interaction model
  • There are 17 parameters used in the interaction model after BIC step wise feature selection, which is relatively high.

  • The R^2 value of the model is 0.726,which is less than the R^2 value of full model. That is 72.6% of the variance in selling_price_in_10k is explained by the predictors.

  • The variables transmission, seller_type, fuel and max_power have a Variance Inflation Factor (VIF) > 5, indicating multicollinearity among all predictors used in the interaction term.

  • The Breusch-Pagan failed with very high test statistic value of BP = 1522.5 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.73061 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 26.73532, which is low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “ownerFourth & Above Owner”(p_value=0.159804)

Logarithmic transformation of response variable

  • From the above discussed model’s Fitted vs Residual plot we can see there is a funnel shaped pattern in the residuals and the constant variance assumption is violated.
  • Logarithmic transformation can make the positively skewed distribution of selling price to be normally distributed and makes it more suitable for model building and for stabilizing the variance.
train_data$selling_price_in_10k = log(train_data$selling_price_in_10k)

Full Additive linear model(log transformed) | with BIC

# Definition of full additive linear model with BIC-based stepwise feature selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model_log = step(lm(selling_price_in_10k ~ ., data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_full_add_model_log)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + km_driven_in_10k + 
##     fuel + seller_type + transmission + owner + fuel_efficiency + 
##     engine + max_power + seats + make_category, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60851 -0.17245  0.01754  0.19210  1.35612 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.161e+02  3.003e+00 -71.946  < 2e-16 ***
## year                         1.084e-01  1.506e-03  71.974  < 2e-16 ***
## km_driven_in_10k            -4.752e-03  1.020e-03  -4.657 3.28e-06 ***
## fuelPetrol                  -1.456e-01  1.280e-02 -11.374  < 2e-16 ***
## seller_typeIndividual       -5.924e-02  1.383e-02  -4.283 1.88e-05 ***
## seller_typeTrustmark Dealer  7.623e-02  5.896e-02   1.293    0.196    
## transmissionManual          -1.174e-01  1.643e-02  -7.144 1.03e-12 ***
## ownerFourth & Above Owner   -1.192e-01  2.837e-02  -4.201 2.70e-05 ***
## ownerSecond Owner           -7.388e-02  9.843e-03  -7.506 7.13e-14 ***
## ownerTest Drive Car          6.353e-01  1.296e-01   4.901 9.80e-07 ***
## ownerThird Owner            -1.074e-01  1.639e-02  -6.551 6.27e-11 ***
## fuel_efficiency              1.477e-02  1.927e-03   7.669 2.05e-14 ***
## engine                       2.269e-04  1.929e-05  11.759  < 2e-16 ***
## max_power                    7.989e-03  2.222e-04  35.958  < 2e-16 ***
## seats                        7.726e-02  6.550e-03  11.796  < 2e-16 ***
## make_categoryLuxury          4.540e-01  3.318e-02  13.681  < 2e-16 ***
## make_categoryMidrange        1.313e-01  9.036e-03  14.530  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2866 on 5272 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8514 
## F-statistic:  1894 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_full_add_model_log))
## [1] 17
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_full_add_model_log) / (1 - hatvalues(bic_full_add_model_log))) ^ 2))
## [1] 0.2871987
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model_log,which=c(1,2))

# Diagnostic tests
bp_test = bptest(bic_full_add_model_log)
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_full_add_model_log
## BP = 387.08, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model_log), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98726, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model_log)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.184945  1        1.478156
## km_driven_in_10k 1.643485  1        1.281985
## fuel             2.603160  1        1.613431
## seller_type      1.117854  2        1.028244
## transmission     1.404885  1        1.185278
## owner            1.407956  4        1.043695
## fuel_efficiency  3.679447  1        1.918188
## engine           5.853557  1        2.419412
## max_power        3.175790  1        1.782075
## seats            2.684514  1        1.638449
## make_category    1.929111  2        1.178526
Results of Full additive model with log transformation
  • There are 17 parameters used in the interaction model after BIC step wise feature selection, which is relatively high.

  • The full additive model after log transformation has R^2=0.8518, indicating that 85.18% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 387.08 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.98726 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 0.2871987, which is very low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.196)

Small Additive linear model(log transformed) | with BIC

  • The initial small additive model is built with the features that were found significant in deciding the selling price of the car through data analysis
  • The BIC step wise feature selection is done.
bic_small_add_model_log = step(lm(selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner +engine, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_small_add_model_log)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power + 
##     transmission + fuel + seller_type + owner + engine, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59461 -0.17061  0.01788  0.19184  1.33769 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.348e+02  2.493e+00 -94.197  < 2e-16 ***
## year                         1.180e-01  1.236e-03  95.497  < 2e-16 ***
## make_categoryLuxury          3.908e-01  3.299e-02  11.845  < 2e-16 ***
## make_categoryMidrange        9.883e-02  8.720e-03  11.333  < 2e-16 ***
## max_power                    7.378e-03  2.189e-04  33.711  < 2e-16 ***
## transmissionManual          -1.035e-01  1.661e-02  -6.228 5.09e-10 ***
## fuelPetrol                  -1.863e-01  9.775e-03 -19.057  < 2e-16 ***
## seller_typeIndividual       -6.535e-02  1.400e-02  -4.666 3.15e-06 ***
## seller_typeTrustmark Dealer  5.958e-02  5.989e-02   0.995     0.32    
## ownerFourth & Above Owner   -1.138e-01  2.880e-02  -3.952 7.86e-05 ***
## ownerSecond Owner           -8.023e-02  9.961e-03  -8.055 9.77e-16 ***
## ownerTest Drive Car          6.275e-01  1.317e-01   4.764 1.94e-06 ***
## ownerThird Owner            -1.157e-01  1.660e-02  -6.966 3.67e-12 ***
## engine                       2.606e-04  1.284e-05  20.286  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2912 on 5275 degrees of freedom
## Multiple R-squared:  0.8469, Adjusted R-squared:  0.8466 
## F-statistic:  2245 on 13 and 5275 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_small_add_model_log))
## [1] 14
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_small_add_model_log) / (1 - hatvalues(bic_small_add_model_log))) ^ 2))
## [1] 0.2916997
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_small_add_model_log,which=c(1,2))

# Diagnostic tests
bptest(bic_small_add_model_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_small_add_model_log
## BP = 390.86, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_small_add_model_log), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98479, p-value < 2.2e-16
# Multicollinearity test
vif(bic_small_add_model_log)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.426062  1        1.194179
## make_category 1.708071  2        1.143211
## max_power     2.984259  1        1.727501
## transmission  1.390403  1        1.179153
## fuel          1.469690  1        1.212308
## seller_type   1.109027  2        1.026208
## owner         1.390298  4        1.042050
## engine        2.512669  1        1.585140
Results of Small additive model with log transformation
  • There are 14 parameters used in the interaction model after BIC step wise feature selection.

  • The small additive model after log transformation has R^2=0.8469, indicating that 84.69% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 390.86 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.98479 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 0.2916997, which is very low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.32)

ANOVA of Small and Full additive models (log transformed models)

anova(bic_small_add_model_log, bic_full_add_model_log)
## Analysis of Variance Table
## 
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission + 
##     fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type + 
##     transmission + owner + fuel_efficiency + engine + max_power + 
##     seats + make_category
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5275 447.40                                  
## 2   5272 433.03  3    14.366 58.301 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From the above Analysis of Variance(ANOVA) result we can see that Full additive model is statistically better than small additive model, based on the high F-test low p_value and lower RSS.
  • Full additive model has lower RSS (433.03) than small additive model (447.40), indicating that full additive model fits the data better.
  • Low p_value indicates that the small additive model is NOT sufficient in explaining the variance of the response variable.
  • Hence we reject the null hypothesis that H0: small additive model is sufficient in explaining the variance of the response variable.
  • We also understand that full model is more complex with 17 parameters and is comparitively harder to interpret than the small model.
  • Hence we can select the small model since the difference in RSS is very small (14.37) and it is comparatively easier to interpret

Influential points

indicies_to_exclude = unname(which(cooks.distance(bic_small_add_model_log) > 4/length(cooks.distance(bic_small_add_model_log))))
train_data_filtered = train_data[-indicies_to_exclude,]
#Number of excluded observations
nrow(train_data) - nrow(train_data_filtered)
## [1] 275

Refitting the BIC Small additive model without influential points

n= nrow(train_data_filtered)
refitted_bic_small_add_model_log = lm(formula = formula(bic_small_add_model_log), data = train_data_filtered)

# Model summary
summary(refitted_bic_small_add_model_log)
## 
## Call:
## lm(formula = formula(bic_small_add_model_log), data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91499 -0.16399  0.01396  0.17626  0.87203 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -2.385e+02  2.351e+00 -101.419  < 2e-16 ***
## year                         1.198e-01  1.166e-03  102.774  < 2e-16 ***
## make_categoryLuxury          3.275e-01  3.318e-02    9.869  < 2e-16 ***
## make_categoryMidrange        7.794e-02  7.729e-03   10.084  < 2e-16 ***
## max_power                    7.895e-03  2.014e-04   39.200  < 2e-16 ***
## transmissionManual          -9.728e-02  1.499e-02   -6.491 9.37e-11 ***
## fuelPetrol                  -1.772e-01  8.716e-03  -20.333  < 2e-16 ***
## seller_typeIndividual       -7.352e-02  1.250e-02   -5.880 4.37e-09 ***
## seller_typeTrustmark Dealer  9.465e-03  7.136e-02    0.133  0.89449    
## ownerFourth & Above Owner   -9.419e-02  3.216e-02   -2.929  0.00342 ** 
## ownerSecond Owner           -6.530e-02  8.900e-03   -7.337 2.53e-13 ***
## ownerThird Owner            -9.013e-02  1.543e-02   -5.840 5.56e-09 ***
## engine                       2.717e-04  1.175e-05   23.120  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2529 on 5001 degrees of freedom
## Multiple R-squared:  0.8671, Adjusted R-squared:  0.8668 
## F-statistic:  2719 on 12 and 5001 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(refitted_bic_small_add_model_log))
## [1] 13
#Leave one out cross validation RMSE 
sqrt(mean((resid(refitted_bic_small_add_model_log) / (1 - hatvalues(refitted_bic_small_add_model_log))) ^ 2))
## [1] 0.2531371
# Diagnostic plots
par(mfrow = c(1, 2))
plot(refitted_bic_small_add_model_log,which=c(1,2))

# Diagnostic tests
bptest(refitted_bic_small_add_model_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  refitted_bic_small_add_model_log
## BP = 293.12, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(refitted_bic_small_add_model_log), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.99396, p-value = 1.123e-13
# Multicollinearity test
vif(refitted_bic_small_add_model_log)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.415566  1        1.189776
## make_category 1.522897  2        1.110881
## max_power     2.696754  1        1.642180
## transmission  1.280517  1        1.131599
## fuel          1.470022  1        1.212445
## seller_type   1.094271  2        1.022778
## owner         1.351841  3        1.051528
## engine        2.418486  1        1.555148
Results of refitted small additive model with log transformation | Without influential points
  • There are 13 parameters used in the interaction model after BIC step wise feature selection.

  • The small refitted additive model after log transformation has R^2=0.8671, indicating that 86.71% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 293.12 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99396 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 0.2531371, which is very low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.89449)

Weighted least square approach

  • In all of the above models we can see that the constant variance assumption is violated.
  • To resolve this we can try weighted linear regression approach
  • In weighted linear regression, observations with higher weights contribute more to the estimation of the regression coefficients, while observations with lower weights have less influence.
  • This approach isused when there is non-constant variance in the residuals or when certain observations are more reliable than others.
fitted = fitted(refitted_bic_small_add_model_log)
weights = 1 / (fitted^2)
weighted_model = lm(formula=formula(refitted_bic_small_add_model_log),data=train_data_filtered,weights=weights)

# Model summary
summary(weighted_model)
## 
## Call:
## lm(formula = formula(refitted_bic_small_add_model_log), data = train_data_filtered, 
##     weights = weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30968 -0.04312  0.00466  0.04837  0.33746 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -2.291e+02  1.994e+00 -114.912  < 2e-16 ***
## year                         1.152e-01  9.897e-04  116.376  < 2e-16 ***
## make_categoryLuxury          3.262e-01  4.715e-02    6.917 5.18e-12 ***
## make_categoryMidrange        7.087e-02  8.253e-03    8.588  < 2e-16 ***
## max_power                    8.081e-03  2.252e-04   35.876  < 2e-16 ***
## transmissionManual          -9.595e-02  1.809e-02   -5.303 1.19e-07 ***
## fuelPetrol                  -1.667e-01  9.262e-03  -17.993  < 2e-16 ***
## seller_typeIndividual       -8.198e-02  1.432e-02   -5.725 1.09e-08 ***
## seller_typeTrustmark Dealer  8.932e-03  8.664e-02    0.103    0.918    
## ownerFourth & Above Owner   -1.288e-01  2.536e-02   -5.079 3.93e-07 ***
## ownerSecond Owner           -7.539e-02  8.813e-03   -8.553  < 2e-16 ***
## ownerThird Owner            -9.645e-02  1.384e-02   -6.970 3.58e-12 ***
## engine                       2.626e-04  1.299e-05   20.219  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07434 on 5001 degrees of freedom
## Multiple R-squared:  0.8958, Adjusted R-squared:  0.8955 
## F-statistic:  3582 on 12 and 5001 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(weighted_model))
## [1] 13
#Leave one out cross validation RMSE 
sqrt(mean((resid(weighted_model) / (1 - hatvalues(weighted_model))) ^ 2))
## [1] 0.2536943
# Diagnostic plots
par(mfrow = c(1, 2))
plot(weighted_model,which=c(1,2))

# Diagnostic tests
bptest(weighted_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  weighted_model
## BP = 144.39, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(weighted_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.99306, p-value = 6.801e-15
# Multicollinearity test
vif(weighted_model)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.541822  1        1.241701
## make_category 1.419449  2        1.091516
## max_power     2.715823  1        1.647975
## transmission  1.205858  1        1.098116
## fuel          1.595998  1        1.263328
## seller_type   1.076706  2        1.018648
## owner         1.359096  3        1.052467
## engine        2.560109  1        1.600034
Results of weighted least square model with log transformation | Without influential points
  • There are 13 parameters used in the interaction model after BIC step wise feature selection.

  • The small refitted additive model after log transformation has R^2=0.8958, indicating that 89.58% of the variance in selling_price_in_10k is explained by the predictors.

  • All predictors have a Variance Inflation Factor (VIF) < 5, indicating no multicollinearity among the predictors.

  • The Breusch-Pagan failed with test statistic value of BP = 144.39 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99306 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 0.2536943, which is very low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.918)

Full Interaction model (log transformed) | with BIC

n=nrow(train_data)
bic_int_model_log = step(lm(selling_price_in_10k ~ (year + make_category + max_power + fuel + transmission + seller_type + owner + engine)^2, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_int_model_log)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power + 
##     fuel + transmission + owner + engine + year:make_category + 
##     year:max_power + year:fuel + year:engine + make_category:max_power + 
##     make_category:fuel + make_category:transmission + make_category:engine + 
##     max_power:fuel + max_power:engine + fuel:engine, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39319 -0.15527  0.02382  0.17903  1.09406 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              -2.297e+02  9.750e+00 -23.563  < 2e-16
## year                                      1.152e-01  4.842e-03  23.790  < 2e-16
## make_categoryLuxury                       5.886e+01  1.773e+01   3.320 0.000906
## make_categoryMidrange                    -8.295e+00  4.934e+00  -1.681 0.092783
## max_power                                -1.100e+00  1.120e-01  -9.824  < 2e-16
## fuelPetrol                                5.786e+01  5.410e+00  10.694  < 2e-16
## transmissionManual                       -1.063e-01  2.319e-02  -4.586 4.63e-06
## ownerFourth & Above Owner                -9.922e-02  2.686e-02  -3.694 0.000223
## ownerSecond Owner                        -6.833e-02  9.293e-03  -7.352 2.25e-13
## ownerTest Drive Car                       4.369e-01  1.311e-01   3.333 0.000866
## ownerThird Owner                         -1.073e-01  1.547e-02  -6.939 4.42e-12
## engine                                    3.873e-02  7.375e-03   5.251 1.57e-07
## year:make_categoryLuxury                 -2.946e-02  8.786e-03  -3.353 0.000805
## year:make_categoryMidrange                3.877e-03  2.446e-03   1.585 0.113074
## year:max_power                            5.547e-04  5.558e-05   9.981  < 2e-16
## year:fuelPetrol                          -2.872e-02  2.684e-03 -10.698  < 2e-16
## year:engine                              -1.897e-05  3.664e-06  -5.177 2.34e-07
## make_categoryLuxury:max_power            -1.695e-03  1.238e-03  -1.370 0.170786
## make_categoryMidrange:max_power          -3.766e-03  4.658e-04  -8.086 7.58e-16
## make_categoryLuxury:fuelPetrol            1.324e-01  9.359e-02   1.414 0.157295
## make_categoryMidrange:fuelPetrol          1.899e-01  2.054e-02   9.245  < 2e-16
## make_categoryLuxury:transmissionManual   -3.519e-01  7.453e-02  -4.721 2.41e-06
## make_categoryMidrange:transmissionManual  1.562e-02  3.218e-02   0.485 0.627461
## make_categoryLuxury:engine                6.304e-04  1.085e-04   5.811 6.56e-09
## make_categoryMidrange:engine              5.636e-04  3.009e-05  18.731  < 2e-16
## max_power:fuelPetrol                      3.266e-03  8.282e-04   3.944 8.13e-05
## max_power:engine                         -4.588e-06  3.289e-07 -13.947  < 2e-16
## fuelPetrol:engine                        -4.266e-04  6.460e-05  -6.605 4.38e-11
##                                             
## (Intercept)                              ***
## year                                     ***
## make_categoryLuxury                      ***
## make_categoryMidrange                    .  
## max_power                                ***
## fuelPetrol                               ***
## transmissionManual                       ***
## ownerFourth & Above Owner                ***
## ownerSecond Owner                        ***
## ownerTest Drive Car                      ***
## ownerThird Owner                         ***
## engine                                   ***
## year:make_categoryLuxury                 ***
## year:make_categoryMidrange                  
## year:max_power                           ***
## year:fuelPetrol                          ***
## year:engine                              ***
## make_categoryLuxury:max_power               
## make_categoryMidrange:max_power          ***
## make_categoryLuxury:fuelPetrol              
## make_categoryMidrange:fuelPetrol         ***
## make_categoryLuxury:transmissionManual   ***
## make_categoryMidrange:transmissionManual    
## make_categoryLuxury:engine               ***
## make_categoryMidrange:engine             ***
## max_power:fuelPetrol                     ***
## max_power:engine                         ***
## fuelPetrol:engine                        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2714 on 5261 degrees of freedom
## Multiple R-squared:  0.8675, Adjusted R-squared:  0.8668 
## F-statistic:  1275 on 27 and 5261 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_int_model_log))
## [1] 28
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_int_model_log) / (1 - hatvalues(bic_int_model_log))) ^ 2))
## [1] 0.2724199
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model_log,which=c(1,2))

# Diagnostic tests
bptest(bic_int_model_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_int_model_log
## BP = 426.19, df = 27, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model_log), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98083, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model_log)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                                    GVIF Df GVIF^(1/(2*Df))
## year                       2.520117e+01  1        5.020076
## make_category              2.055362e+11  2      673.320845
## max_power                  8.999606e+05  1      948.662521
## fuel                       5.185422e+05  1      720.098743
## transmission               3.121457e+00  1        1.766765
## owner                      1.597883e+00  4        1.060335
## engine                     9.542579e+05  1      976.861272
## year:make_category         2.043151e+11  2      672.318594
## year:max_power             8.999602e+05  1      948.662324
## year:fuel                  5.174020e+05  1      719.306576
## year:engine                9.542519e+05  1      976.858189
## make_category:max_power    3.705888e+03  2        7.802308
## make_category:fuel         8.709786e+00  2        1.717916
## make_category:transmission 2.416656e+01  2        2.217194
## make_category:engine       3.782037e+03  2        7.842083
## max_power:fuel             8.622232e+01  1        9.285598
## max_power:engine           7.362371e+01  1        8.580426
## fuel:engine                1.080291e+02  1       10.393704
Results of interaction model after log transformation
  • There are 28 parameters used in the interaction model after BIC step wise feature selection.

  • The small refitted additive model after log transformation has R^2=0.8675, indicating that 86.75% of the variance in selling_price_in_10k is explained by the predictors.

  • Many interactipon variables engine, fuel year make_ctegory and max_power have a Variance Inflation Factor (VIF) > 5, indicating multicollinearity among all predictors used in the interaction term.

  • The Breusch-Pagan failed with test statistic value of BP = 144.39 and very low p_value (< 2.2e-16), indicating that the null hypothesis is rejected and the constant variance assumption is violated

  • The Shapiro-Wilk normality test is violated with the test statistic value of W= 0.99306 and very low p_value (< 2.2e-16) indicating that the null hypothesis is rejected and the normality assumption is violated

  • The LOOCV RMSE of the model is 0.2536943, which is very low when compared to the range of values for the response variable selling price.

  • All the individual predictors are significant except for “seller_typeTrustmark Dealer”(p_value=0.918)

Results

Model Comparison

Model Description Adjusted R^2 Number of
parameters
Breusch-Pagan test Shapiro-Wilk test
(with 5k sample residuals)
VIF>5 LOOCV
RMSE
1 Full additive model 0.7122 17 Constant variance assumption
violated(BP = 1310)
Normality assumption vioalted
with W = 0.71385
None 27.30411
2 Small additive model 0.7097 14 Constant variance assumption
violated(BP = 1308.8)
Normality assumption vioalted
with W = 0.70295
None 27.39708
3 Interaction model 0.726 17 Constant variance assumption
violated(BP = 1522.5)
Normality assumption vioalted
with W = 0.73061
transmission
seller_type
fuel
max_power
26.73532
  • Full additive model (Model 1) and Small additive model (Model 2) performed similarly in terms of Adjusted R² (~0.71) but violated the constant variance and normality assumptions.
  • The Interaction model (Model 3) improved the Adjusted R² to 0.726. But, the variance inflation factor (VIF) indicated multicollinearity with predictors transmission, seller_type, fuel, and max_power and violated the assumptions

The assumptions of constant variance and normality were significantly violated, suggesting that a log transformation of the response variable might improve the model.

Model Comparison after log transformation of response variable

Model Description Adjusted R^2 Number of
parameters
Breusch-Pagan test Shapiro-Wilk test
(with 5k sample residuals)
VIF>5 LOOCV
RMSE
1 Full additive model with BIC 0.8518 17 Constant variance assumption
violated(BP = 387.08)
Normality assumption vioalted
with W = 0.98726
None 0.2871987
2 Small additive model with BIC 0.8469 14 Constant variance assumption
violated(BP = 390.86)
Normality assumption vioalted
with W = 0.98479
None 0.2916997
3 Refitted small additive model
without influential points
0.8671 13 Constant variance assumption
violated(BP = 293.12)
Normality assumption vioalted
with W = 0.99396
None 0.2531371
4 Small additive BIC weighted model 0.8958 13 Constant variance assumption
violated(BP = 144.39)
Normality assumption vioalted
with W = 0.99306
None 0.2536943
5 Full Interaction BIC model 0.8675 28 Constant variance assumption
violated(BP = 426.19)
Normality assumption vioalted
with W = 0.98083
engine, fuel year,
make_ctegory,
max_power
0.2724199
  • The Full additive model with BIC (Model 1) had an Adjusted R² of 0.8518, which was much higher than models without transformation.
  • The Small additive model with BIC (Model 2) simplified the model with fewer parameters (14) and had a similar Adjusted R² of 0.8469.
  • The Refitted small additive model without influential points (Model 3) further improved the Adjusted R² to 0.8671 but still violated the assumptions.
  • The Small additive BIC weighted model (Model 4) performed the best among all models, with
    • Highest R^2 (0.8958)
    • One of the model with least number of parameters (13)
    • Lowest Breusch-Pagan test statistic (BP = 144.39), indicating a relatively better fit for the constant variance assumption.
    • All predictors with VIF < 5, eliminating multicollinearity amoung predictors.
    • Lower LOOCV RMSE 0.2536943
  • The Full Interaction BIC model (Model 5) had similar performance but used significantly more parameters (28) and still violated constant variance assumption.

Final model

  • From all the above analyzed models, we can see that the constant varaince assumption is violated for all the models, and this could be possible for a real dataset.

  • Based on the analysis, the Small additive BIC weighted model (Model 4 after log transformation) was chosen as the final model for the following reasons:

    Model: lm(formula = selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner + engine, data = train_data)

    1. It achieves the highest Adjusted R² (0.8958), explaining nearly 90% of the variability in the log-transformed selling price of used cars.
    2. It uses only 13 parameters, the model is more interpretable and avoids overfitting.
    3. It has the lowest BP test statistic (144.39), suggesting a relatively better adherence to the constant variance assumption.
    4. The predictors have VIF < 5, indicating no multicollinearity among predictors
    5. The LOOCV RMSE (0.2536943) is the lower.

Test predictions

price_predictions_with_log = predict(weighted_model, newdata = test_data)
price_predictions = exp(price_predictions_with_log)
TEST_RMSE = sqrt(mean((test_data$selling_price_in_10k - price_predictions)^2))
TEST_RMSE
## [1] 18.64312
sst = sum((test_data$selling_price_in_10k - mean(test_data$selling_price_in_10k))^2)
sse = sum((test_data$selling_price_in_10k - price_predictions)^2)
TEST_R_SQRD = 1 - (sse / sst)
TEST_R_SQRD
## [1] 0.8731878

Diagnostic checks

test_residuals = test_data$selling_price_in_10k - price_predictions

plot(price_predictions, test_residuals,
    xlab = "Predicted Selling Price",
     ylab = "Residuals",
     main = "Residuals vs Predicted Values",
     pch = 16, col = "blue")
abline(h = 0, col = "red", lwd = 2)  

  • Similar to the training data, the constant variance assumption is violated and we can see that as the fitted value increases, the residuals are also deviating from zero and the residuals are not closer to zero.

  • But the test R^2 value(0.8731878) indicates that 87.31% of the of the variance in selling_price_in_10k is explained by the predictors.

  • The Root mean square error TEST_RMSE = 18.64312, indicates a low error value when compared to the range of values of selling price.

Comparison of predicted and actual values

plot(test_data$selling_price_in_10k, price_predictions, xlab = "Actual Price", 
     ylab = "Predicted Price", 
     main = "Predicted vs. Actual Prices",)
abline(a = 0, b = 1, col = "blue")

  • The above plot, plots the actual selling price of the car from test data along the x-axis and predicted price of the car along the y-axis.
  • The blue diagonal line indicates points where actual price is equal to predicted price
  • Points close to the line indicate accurate predictions. The model performs well for lower price ranges (0–100)
  • For price range above 200, the model is not able to capture the data well.

Discussion

The Small additive BIC weighted model gave better factors influencing used car prices:

Model: lm(formula = selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner + engine, data = train_data)

Year of manufacture:

Engine and max_power:

Fuel type:

Transmission:

Other

Logarithmic transformation can made the positively skewed distribution of selling price to be normally distributed and made it more suitable for model building and for stabilizing the variance.

Recommendations

  1. Use of WLS model helped in stabilizing the varaince of the residuals.
  2. Log-Transformation of the response variable made it more suitable for model building and for stabilizing the variance.

Future Enhancements

  1. Advanced machine learning models like Random Forests or Gradient Boosting can be explored to build better models.
  2. Investigating additional features, such as geographic locations of the cars, can be used to enhance the model performance.

Appendix